The dataset is taken from https://wonder.cdc.gov. It focuses on STD Infection Rate in United States during 1996-2011. The Rate is calculated per 100,000. It contains Chlamydia, Gonorrhea, and Syphilis diseases. The analysis will give an idea about the rise and fall of different STDs in America. As per the website there are no records for New York from 1996 to 1999. The citation is as follows:
US Department of Health and Human Services, Centers for Disease Control and Prevention, National Center for HIV, STD and TB Prevention (NCHSTP), Division of STD/HIV Prevention, Sexually Transmitted Disease Morbidity for selected STDs by age, race/ethnicity and gender 1996-2011 Archive, CDC WONDER Online Database.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
sb.set_style('darkgrid')
#sb.set_palettet(sb.color_palette()[0])
import plotly.express as px
df = pd.read_csv('STDs by Age,Year,State and Gender, 1996-2011 Archive.csv')
df.head()
df.tail()
df.shape
df.info()
# replace spaces with underscores and lowercase labels for 2008 dataset
df.rename(columns=lambda x: x.strip().lower().replace(" ", "_"),
inplace=True)
df.head()
df.duplicated().sum()
df.isna().sum()
The dataset is pretty clean containing 27053 records with 13 columns.The columns are as follows
['Disease', 'Disease Code', 'State', 'State Code', 'Year', 'Year Code', 'Gender', 'Gender Code', 'Age', 'Age Code', 'STD Cases', 'Population', 'Rate']
• STD Cases
• Rate
• Year
• Disease
• State
• Gender
• Disease
• STD Cases
• Rate
• Year
plt.figure(figsize=(27,17))
sb.countplot(data=df,y=df['state'],
color=sb.color_palette()[0],order=df['state'].value_counts().index)
plt.title('Count of states that reported STD cases',fontsize=20)
plt.xlabel('count',fontsize=17)
plt.yticks(fontsize=15)
plt.xticks(fontsize=15)
plt.ylabel('State',fontsize=17);
• Louisiana, Texas and Tennessee have the highest reported STD cases
• Vermont and Wyoming are the states with less number of reported STD cases
plt.figure(figsize=(10,7))
#plot
sb.countplot(data=df, x=df['gender'],
order=df['gender'].value_counts().index ,color=sb.color_palette()[0],);
# setting title and labels
plt.title('Which gender suffer more from STDs ?',fontsize=20)
plt.xticks(fontsize=15)
plt.xlabel('Gender',fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Frequency',fontsize=15);
• It is evident to say that Males suffer more from STDs than Females
df['rate'].describe()
np.log10(df['rate'].describe())
plt.figure(figsize=(17,10))
plt.suptitle('Distribution of STD rates',fontsize=20)
#left histogram: data plotted in natural points
plt.subplot(1,2,1)
bins = np.arange(0,df['rate'].max() + 100,100)
plt.hist(data=df,x='rate',bins=bins);
plt.xlabel('values',fontsize=15)
plt.xticks(fontsize=15);
plt.yticks(fontsize=15);
#plt.title('Distribution of STD rates',fontsize=20)
#right histogram: data plotted after log transformation
plt.subplot(1,2,2)
bins = 10 ** np.arange(-1,4 + 0.1,0.1)
ticks = [0.1, 1, 3, 10, 30, 100, 300, 1000,5000]
plt.hist(df['rate'],bins=bins);
plt.xscale('log');
plt.xticks(ticks,ticks,fontsize=15)
plt.xlabel('log(values)',fontsize=15);
plt.yticks(fontsize=15);
• The majority of data points with value above 500 mash the majority of the points into the bins on the far left.
• This indicates axis transformation is needed to visualize the distribution of std rates.
• The logarithmic scale transformation shows that the data is roughly bimodal, with first peak somewhere between 3 and 10, and second peak around 100, finally the largest peak around 300. The rate distribution cuts off at its maximum, rather than declining in a smooth tail.
case_counts = df['std_cases'].value_counts()
case_counts.index
uni_plt = df.groupby(['state']).sum()
uni_plt.sort_values('std_cases',ascending=False,inplace=True)
uni_plt = uni_plt.reset_index('state')
uni_plt.head(10)
plt.figure(figsize=(17,10))
sb.barplot(data=df, x='state',y='std_cases',color=sb.color_palette()[0])
plt.title('Number of STD Cases in the States')
plt.xlabel('State')
plt.xticks(rotation=90)
plt.ylabel('STD Cases');
• Texas and California have the highest number of STD Cases
• Vermont and Wypoming have the lowest number of STD Cases
df['rate'].describe()
plt.figure(figsize=(17,10))
sb.barplot(data=df, x='state',y='rate',color=sb.color_palette()[0])
plt.title('STD rates in the States')
plt.xlabel('State')
plt.xticks(rotation=90)
plt.ylabel('Rate');
• Alaska and DC recorded higher growth for STD cases in comparison to other states
• While New Jersey had the least growth by the end of 2011
Based on the bivariate plots it looks like there is something wrong with rate column.One of the ways is to use standard deviation of the rate column. Lets, recalculate rate and make a new data frame using groupby() with name: 'actual_rate1'
actual_rate1= df.groupby(['year','disease']).std()
actual_rate1.dropna(inplace=True)
actual_rate1 = actual_rate1.reset_index()
actual_rate1
plt.figure(figsize=(10,7))
sb.lineplot(x='year',y='rate',hue='disease',data=actual_rate1);
plt.title('Overall trend of STDs in US since 1996');
fig = px.line(actual_rate1, x='year', y='rate',color='disease')
fig.update_layout(
title = "Overall trend of STDs in US since 1996"
)
fig.show()
• Overall, Chlamydia show a constant increase reaching a peak of over 900
• Gonorrhea reported a steady decrese by the end of 2010
• On the other hand, Syphilis had a steady trend over the period of time.
fig = px.histogram(df, x='age', y='rate',color='disease')
fig.update_layout(
title = "Widespread of STDs over ages"
)
fig.show()
• Teenage years had paramount STDs rate
• Among STDs, Chlamydia was more prevalent in most age groups while Syphilis was the least
• During the visualization of STD rate, the histogram generated was too much skewed to the left.
• This was corrected using log scale transformation which gave a clear idea on the distribution of the STD rates
• In the bivariate plot of STD rate vs State, the rate showed out of ordinary behaviour. The attempt was made to fix rate column in multivariate plots that made sense when combined with features
From the visualizations above derivations made are:
• Texas has maximal number of STD Cases
• Texas and California reported the largest rise in STDs
• Overall, Chlamydia has increased over the years
• The first drop for Gonorrhea was in year 2001
• The rates across the ages is unimodal with age 20-24 years having the highest rate of STDs
• From multivariate plots, it can be derived that Chlamdydia was more common among STDs
• States with low STD cases like Vermont and Wyoming also have least rate. This proves there is a propotional relation between STD cases and rate. But in case of the states with highest STD Cases this relationship does not follow. Thus, further analysis is required.